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Abstract. There are variety of computational algorithms need sequential sweeping; sweeping 
based on specific order; on a structured grid, e.g., preconditioning (smoothing) by SOR or ILU 
methods and solution of cikonal equation by fast sweeping algorithm. Due to sequential nature, 
parallel implementation of these algorithms usually leads to miss of efficiency; e.g. a significant 
convergence rate decay. Therefore, there is an interest to parallelize sequential sweeping procedures, 
keeping the efficiency of the original method simultaneously. This paper goals to parallelize sequential 
sweeping algorithms on structured grids, with emphasis on SOR and ILU preconditioners. The 
presented method can be accounted as an overlapping domain decomposition method combined to 
a multi-frontal sweeping procedure. The implementation of method in one and two dimensions are 
discussed in details. The extension to higher dimensions and general structured n-diagonal matrices is 
outlined. Introducing notion of alternatively block upper-lower triangular matrices, the convergence 
theory is established in general cases. Numerical results on model problems show that, unlike related 
alternative parallel methods, the convergence rate and efficiency of the presented method is close 
to the original sequential method. Numerical results also support successful use of the presented 
method as a cache efficient solver in sequential computations as well. 

Key words, domain decomposition, overlapping decomposition, parallel GaussSeidel, parallel 
sweeping. 
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1. Introduction. Structured grids are commonly used in scientific computing 
for the purpose of numerical solution of partial differential equations (PDEs) . Despite 
the unstructured grid based methods which manage domain complexity in a more 
consistent manner, in certain applications using structured grids is still the best choice, 
mainly due to preference in terms of consumed memory, efficiency and the ease of 
implementation. Moreover, with the invention of immersed boundary methods with 
make possible to manage complex geometries on structured grids, there is recently a 
significant interest to application of structured grids in scientific computing. Progress 
in power of supercomputer changes the bias in favor of structured grids too, as they 
are usually more appropriate for parallel computing. 

Some of numerical algorithms on structured grids use sequential sweeping; grid 
by grid procedure on a mandatory order. In fact, using the conventional version 
of sequential algorithms, it is not possible to perform computations simultaneously. 
This limit us to use full power of multi-processor shared and/or distributed memory 
machines. 

Since the advent of parallel computers, several parallel versions of the SOR 
method have been developed. Most variants of these algorithms have been devel- 
oped by using multi-color ordering schemes or domain decomposition techniques 
[2, 13, 1, 17, 27, 14, 15, 5, 7, 19, 11, 23, 29]. In contrast to domain decomposi- 
tion method, a multi-color SOR method has usually a faster convergence rate, but is 
usually more difficult to solve elliptic boundary value problems in complex domains 
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[28]. To improve the convergence rate of a domain decomposition algorithms, a novel 
mesh partition strategy was proposed in [29] which led to an efficient parallel SOR. 
This method has the same asymptotic rate of convergence as the Red-Black SOR 
[29]. Although, multi-color SOR algorithms are also usually implemented based on a 
domain decomposition strategy, the Xie and Adams's algorithm [29] takes less inter- 
processor data communication time than a multi-color SOR. The blocked version of 
the Xie and Adams's parallel SOR algorithm [29] is suggested in Xie [28]. According 
to presented results, the rate of convergence and parallel performance of this version 
is better than that of the non-blocked version. 

The main drawback of the above mentioned parallel versions of SORs is lower 
rate of convergence in contrast to original one. This is mainly due to enforced decou- 
pling between equations to provide some degree of concurrency in computations. For 
example in the m-color SOR method, an increase in the number of colors decreases 
the rate of convergence. So a key to achieve an efficient parallel SOR method is to 
minimize the degree of decoupling due to parallelization. 

Other techniques such as the pipelining of computation and communication and 
an optimal schedule of a feasible number of processors are also applied to develop par- 
allel SOR methods when the related coefficient matrix is banded [16, 20, 10, 18, 4, 3]. 
These techniques can isolate parts of the SOR method which can be implemented in 
parallel without changing the sequential SOR. The convergence rate of these meth- 
ods are same as the original one but the application is limited to banded-structure 
matrices. 

The present goals to suggest a new method to parallelize sequential sweeping on 
structured grids. The specific focus is on the SOR method which can be regarded as an 
iterative preconditioner, smoother or even a linear solver. Without loss of generality, 
the later case is taken into account in this study to simplify arguments. Since there is 
a direct correspondence between SOR and ILU preconditioners on structured grids, 
the presented method can be used to develop parallel ILU methods as well. 

The rest of this paper is organized as follows. Sections 2 and 3 present the imple- 
mentation of method in one and two spatial dimensions respectively. The extension 
to three spatial dimensions is commented in Section 4. he Considering the connec- 
tion between SOR and ILU preconditioners on structured grids, Section 5 argues the 
application of presented approach to develop ILU(p) preconditioners on structured 
grids. The convergence theory is established in Section 6. Section 7 extend utility of 
the presented method to general structured n-diagonal matrices. Section 8 is devoted 
to numerical results on the convergence and performance of the method. Finally Sec- 
tion 9 close this paper by summarizing results and outlining possible application and 
extension of the presented method. 

2. Parallel SOR sweeping in one-dimension. Consider the following scaler 
ID elliptic PDE: 



where CI = [0,L X ], T = dCl, a : Cl ->• R+ is C\ fi e R+ U {0}. Also, it is assumed 
field variables u, uq and / posses the sufficient regularity required by the applied 
numerical method to ensure the existence and uniqueness of the discrete solution. 
Assume the spatial domain is discrtized into an n + 1 intervals with grid points 
Xi = Xi-i + Sxi-i, i = 1, ... ,n and xq = 0, x n+ \ = L x , where Sxi = Xi+x — X4. 




(2.1) 
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Without loss of generality; for the ease of presentation; we discrtize (2.1) with a 
second order finite difference method as follows: 



-CiUi-i+biUi- diUi+i — fi, i = l,...,n. (2.2) 
where subscript i denotes the value of field variable at x = x, (e.g. Ui — u(xi)), 
2a i+1/2 _ 2a l _ 1/2 



5xi(6xi + Sxi-i) ' 



, bi = di + Ci+ {), 



"i+l/2 = 



2ct;a 



OLi + a 



i+i 



") Ot-l/2 



2QiQ;j„i 



. Using boundary conditions ito = wo(xo) and = Uo(a?n+i)> h is possible to write 
above equations in the following matrix form, 



Au 



(2.3) 



A = 



h -ai 
-c 2 b 2 

-C3 



-a 2 



6„ 



where u = [tt X! tt 2 , . . . , u n ] T and f = [ciu + /i, / 2 , ■ ■ ■ , f n -i, fn + a n u n+1 ] T . It is 
evident that matrix A is irreducibly diagonally dominant which ensure the convergence 
of stationary iterative methods, like SOR. The SOR method using the natural row- 
wise ordering (left to right sweeping) generates the following sequence of iterations 
from a given initial guess u 



(0) 



(fc+l) _f. x (fc) W£ (fe+1) 



/i), 



I = 1, 



. ,n, 



(2.4) 



where ujl £ (0, 2) is left to right (LR) sweeping relaxation factor and superscript 
(k) denotes the iteration number. The other alternative SOR method is resulted by 
reversing the sweeping direction (right to left sweeping): 



u i - I 1 - W «K + -r-(CtU i _ 1 + atu i+1 



ft), 



1. 



(2.5) 



where € (0, 2) is the right to left (RL) sweeping relaxation factor. The applica- 
tion of (2.4) and (2.5) alternatively during each two consecutive iterations leads to 
the classical symmetric SOR (SSOR) method. It is clear that these procedures are 
sequential, in the sense that update of grid point i should be performed after its left 
(or right) neighbor. In fact, it is not possible to perform relaxation of more than one 
grid point at each time in the original version of SOR method. 

The goal of parallel implementation of SOR method is to provide possibility of 
concurrent computation, simultaneously keep the convergence rate of original SOR 
method as much as possible. The concept of domain decomposition is used here to 
parallelize the SOR method. Assume spatial domain fi is decomposed into p number 
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of non-overlapping sub-domains, i.e., Q — Uf =i r^, and QiDi^jClj — 0, for i = 1, . . . ,p 
(splitting of grid points). 

For the purpose of convenience, we first describe our method for two (approxi- 
mately equal) sub-domains. As illustrated in figure 2.1, discrtized spatial domain is 
decomposed into sub-domains f^i and i^, where f^i = [0, x m ] and S^2 = [x m +i,L]. If 
(2.4) is used in f2i and the computation begins from the left boundary, and simulta- 
neously, (2.5) is used in J7 2 and the calculation is performed from the right boundary 
toward left direction, then the computation of sub-domains will be independent dur- 
ing the first iteration. In our algorithm during the next iteration (2.5) and (2.4) are 
used respectively in J7i and O2 ; and the computations are started from grid points 
x m and x m +i in the corresponding sub-domains. Therefore, we have the following 
equation at grid point x m , 



(fe+1) _ "ROm (fc+1) fl _ Wfc) ,^R, W , \ 
"m . u m+l V 1 0J R) a m ' i \ c rn ",„-l T Jm)l 

while the following equation should be used at grid point cc m +i, 



(fe+i) 

1 m+l 



^£Cm+l (fc+1) 

h Um 
Om+1 



(1 - + 



^m+1 



( B in+l"m+2 + /m+l)- 



(2.6) 



(2.7) 



2.6 and (2.7) can be written in the following matrix form: 



1 



(fc+1) 

(fc+1) 
i m+l 



' m+l 



(2- 



?*m = (1 - Wfi)M^ + ^(c ra W^! + / m ), 



'm+l = (1 - WiJVl + I ( a ™+l«m+2 + /m+l)- 

Om + 1 

(fc) 

The above 2x2 system of equation can be easily inverted to compute u m+1 and 

u m +\ ■ Computing the new values of grid points x m and i£m+i based on the mentioned 
procedure, the computations for the remaining grid points will be performed trivially 
according to the corresponding update formulae. These sequence of computation 
is repeated for each pair of consecutive iterations until the convergence criteria is 
satisfied. The flow chart of this procedure is displayed in figure 2.1. In this plot © 
and symbols are used to denote the application of (2.4) and (2.5) respectively; also 
symbol ® is used to denote the application of (2.8). 

The extension of this algorithm to more than two sub-domains is a straightforward 
job which can be described as follow (cf. figure 2.2): 

(i) Division of the spatial computational domain into desired number of sub- 
domains, based on desired load balancing constraints. 

(ii) Determination of the sweeping direction for each sub-domain. Sweeping direc- 
tion of each sub-domain should be in opposite direction of its neighbors. For 
example we use LR direction for odd sub-domains and RL direction for even 
sub-domains. These sweeping directions are inverted after each iteration. 

(iii) Updating the starting node of each sub-domain with (2.8) and the remained 
nodes with either of (2.4) or (2.5), based on the corresponding sweeping 
direction. If the starting node be located at the physical boundaries, the 
prescribed boundary value will be used there (there is no need to use (2.8) in 
this case). 

4 




i=l 2 3 m-2 m-l m i m+1 m+2 m+3 n-2 n-1 n 



Fig. 2.1. The diagram of the parallel implementation of the presented SOR method for two 
sub-domains. 



iterk+3 O 0- 


-e — © 


® — ©-- 


-© — © 


0- 


-0 — © 


— 0- 


-© — © 


— 0- 


-0 — © 


® — 0-- 


© — © 


iterk+2 © ©- 


-© — © 


— 0- 


-O 


— 0- 


-© — © 


— 0- 


-O — 


0- 


-© — © 


6 0-- 


— 


Itetk+1 ©-- 


-O 


— ©-- 


-© — © 


— 0- 


— © 


— 0- 


-© — © 


0- 


-O — 


© 0-- 


— © 


iter k © ©- - 


-© — © 


— 0- 


-0 — © 


0- 


-e — © 


— 0- 


— © 


© 0- 


-© — © 


O 0-- 


— 


n, 








S! 3 




a, 













Fig. 2.2. The diagram of the parallel implementation of the presented SOR method in one- 
dimension for multiple number of sub-domains. 



Now let's to look at the properties of the presented parallel algorithm and its 
connection to other domain decomposition methods. It is obvious that this algorithm 
is an overlapping domain decomposition method which changes the overlap regions 
area alternatively. As the method has some similarities to overlapping Schwarz algo- 
rithm, lets to mention some differences of our algorithm with the Schwarz method. 
In contrast to the Schwarz method the size of overlap regions here are changes al- 
ternatively. Sub-system of equations are only solved exactly at overlapped regions 
(of course alternatively); and within internal regions of sub-domains, equations are 
solved approximately (by one-pass SOR iteration). In Schwarz algorithm sub-domain 
boundaries are treated as virtual boundary conditions of Dirichlet, Neumann, Robin 
or mixture of these types. When relaxation factors are equal to unity (GaussSei- 
del iterations), our numerical treatment for sub-domain boundaries are equivalent 
to Neumann- type (flux) boundary conditions, in the other cases our internal bound- 
ary conditions are equivalent to a mixture of Dirichlet and Neumann ones. In the 
later case, the relaxation factors u>l and lor plays role of blending factors. In fact 
our method tries to keep the partial coupling between sub-domain during iterations, 
in expense of a negligible computational cost. It is worth mentioning that due to 
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alteration of sweeping directions we expect good smoothing properties of the pre- 
sented method. Later we will observe that how such enforced local coupling leads to 
competitive convergence rate of the presented method in contrast to that of original 
SOR. 

Note that the procedure for higher order stencils or other numerical methods are 
conceptually the same. We avoid further remarks in this regard to save the space. 

3. Parallel SOR sweeping in two-dimensions. In this section we shall ex- 
tend the presented parallel sweeping algorithm for two-dimensional problems. We 
consider the following scaler 2D elliptic PDE: 

—V ■ (a Vit) + f3u — f in € fl, and u = u (x) on T (3-1) 

where € M 2 , T = dfl, a : ft — > M. 2x2 is C 1 diagonal matrix with diagonal entries 
dx,Ciy & /3 € K + U {0}. For the sake of convenience, it is assumed that £1 = 
[0, La,] x [0,-Ly]. Same as previous, sufficient regularities are assumed for the field 
variables. 

Suppose that fi is divided into (n + 2) and (m + 2) Cartesian grids along x and 
y directions respectively. The grid points tjj) are given by Xi = a^-i + &Ej_i, i = 
l,...,n and y., = 2/2-1 + 8yj-i, j = l,...,m; where x = y = 0, x n+1 = L x , 
2/m+i = L y , Sxi = x.i + i — Xi and Syj = yj+\ — yj. Same as previous, the finite 
difference approximation of (3.1) is used here to describe the presented method (with 
second order central differencing scheme). Higher order finite difference methods as 
well as other numerical methods (like FVM, FEM) can also be employed. The only 
requirement of our method is discretization of PDE on a structured grid. We use 
Wij to denote the finite difference approximations of w(xi,yj). The approximation of 
(3.1) with the mentioned scheme has the following form, 

-afjUij-i - OijUi-ij + afjUij - afju i+ld - a^Uij+i = f itj , (3.2) 

for i = 1, . . . , n and j = 1, . . . , m; where af } = afj + a™ + af } + af^ + /3, 

E _ % a xi+l/2,j w _ % a xi-l/2,j 



i ' 3 Sxi(Sxi + 5xi-i) l,J 5xi-i(5xi + Sxi-i)' 



N _ 2a »'J+l/2 S _. 2aj /i) j_i/ 2 



a ' J <%(<% + <%-i)' *' 3 Syj-iifyj + <%-i)' 
and the mid grid point coefficient are computed by harmonic averaging, e.g. 

^^xi,j^xi-\-l,j _ ^CXyi,j^yi,j + l 
a xi+l/2,j — — ) a yi,j + l/2 = ~ • 

Using the corresponding boundary conditions, 

u ,j = uo(Q,yj), u n+1J = u (x n+1 ,yj), j = l,...,m. 



Ui,o = u (xi,0), Ui im +i = u (xi,y m +i), i = l,...,n. 

Same as previous, it is possible to write above equations in a matrix form. However, 
the presented method is matrix-free and there is no need to write equations explicitly 
in a matrix form at this point. 
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There are variety of sweeping direction to solve (3.2) with SOR method. Consid- 
ering a rectangular computational domain, it is possible to start sweeping procedure 
from one corner toward its corresponding opposite corner. Therefore, there is four 
sweep directions. Using geographic directions (east, west, north, south), it is possible 
to show these directions by SW-to-NE, NE-to-SW, SE-to-NW and NW-to-SE. There 
are various sweeping strategy for each of mentioned directions. These strategies are 
shown in figure 3.1 for SW-to-NE, NE-to-SW directions on a 3 x 3 structured grid. 
Every sweeping strategy should preserve causality; should be performed along a causal 
direction. The causality here means that order of update should be such that new 
values of at least one half of neighboring nodes of the current grid should be known 
prior to its update. It is possible to use distinct relaxation factors along each sweeping- 
direction. We show relaxation parameter of each direction by subscript according to 
its starting corner, i.e., wsw, w^e, wse, wmw- 




(f) (g) (h) 




(a) (b) (c) (d) 



Fig. 3.1. Possible sweeping strategies for SW-to-NE (bottom row), NE-to-SW (top row) direc- 
tions on a 3 X 3 structured grid: order of updating procedure in natural row-wise (a, e), symmetric 
row-wise (b, f) and frontal (c, d, g, h) sweeping strategy. 

The SOR updating (from iteration k to k + 1) formulae for grid point has 
one of the following formulae, 

(fc+1) _ n x (fc) U SW , s „(fc+l) i W,(fc+1) , R „(fc) , n N „(*) , f \ 

u s „■ - (1 - ujsw)u id H p-\ a i,j u i,j-i + a i j u l _ l j + a id u i+l j + a Lj u i j+1 + fa) 



«£ +1) = (1 - -ne)^ + ^(af^-i + °>Z^\i + + «&«SS } + hi) 



(3-3) 
(3.4) 



(1 - useMJ + ^^( a ij u i,j-i + a i,j u i-i,j + a ij u \+ij + a i,j u ij+i + hi) 

(3.5) 

/ -i x (fe) . ^NW , S W , W + 1) i E (fc) N (fc+l) , f \ 

a i,j 

(3.6) 



It is clear that due to required causality condition, the above procedures are 
sequential in nature. To parallelize this sequential procedure, we first split the com- 
putational domain into p (almost) equally-size non-overlapping sub-domains (based 
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on load balancing constraints). To describe our algorithm, we consider a Cartesian 
decomposition in this section, i.e., p — p x x p y where p x along p y are number of 
sub-domains along x and y spatial axes respectively. For the sake of convenience, it 
is assumed that Q is decomposed into a 2 x 2 array of sub-domains (cf. figure 3.2). 



o — o — o- 

T — T — T 

o — 



o — o — o 

V — T — T 

o — — 



o — o — o- 



6 — 6 — a 



o — o — 



6 — 6 — 6 



Fig. 3.2. Decomposition of a 6 X 6 Cartesian mesh into 2x2 array of sub-domains. 



As it is implied in previous, a key to achieve a parallel SOR with a little conver- 
gence rate decay is to minimize degree of decoupling due to parallelizing, as much as 
possible. In the original SOR when a point is updated the remaining points sense its 
new value due to causality of sweeping strategy. 




Fig. 3.3. Decomposition of a 6 X 6 Cartesian mesh into 2x2 array of sub-domains: sweeping 
directions and order of update are shown during two consecutive iterations, coupling between nodes 
are shown by shading. 



To keep the causality of sweeping partially, a multi-frontal strategy is employed 
here for the purpose of parallelization. Consider the mentioned 2x2 Cartesian split- 
ting of the computational domain. According to figure 3.3, we use frontal sweeping 
along directions, NE, NW, SE and SW for sub-domains 1 though 4 respectively. The 
order of updating is shown in plots. It is clear that to update node #1 of each sub- 
domain, we need two unknown values (new values) from two neighbor sub-domains. 
A shaded region in figure show this coupling in iteration fe-th. Therefore a local 4x4 
system of equation should be solved for this purpose. Assuming node #1 of f^i (at 
iteration fc-th, left side of figure) is denoted by and using global index, this local 
system of equation has the following form (application of (3.4), (3.6), (3.5) and (3.3) 
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for S7i though SI4 respectively), 



u SW», + i J + i 



a f+l,j 

a - - 1 t 



1/ 



(fc+1) 
%3 

(fe+1) 

(fe+1) 
^,3 + 1 

,(fe+l) 
i+1,3 + 1 



/ s ( fc ) 

(1 - wne)uW + " NE{a ^~ l^Ul±M 
U - wjvwju i+1 ■ h -p 

(-I W (fe) I "SE(a5 + i"'-i.j + i+ a "j + i"' fc J ) +2+/i,j + i) 

(1 USE)U i>j+1 + fl p — 

/-I ^Sw(a i + i,j + iM i + 2 ,j + l+ai + l,j + l"i + l,j + 2+/i+l,J + i; 

(1 "Sff T p 



W Cfc) 



(3.7) 



This system of equation is non-singular and can be easily inverted. After update 
of the corner node (node 1), we should update nodes on (virtual) boundaries of sub- 
domains. As it is shown in figure 3.3), node #2 of each sub-domain is coupled to 
node #2 of a neighbor sub-domain. Assuming node #1 of f^i is denoted by 
and using global index, using (3.4) and (3.5)the following 2x2 system of equations 
should be solved to compute new values of node $=2 in Qi and f^, in fact (i — l,j) 
and (i — l,j + 1) (cf. 3.3), 



-1,3 



a. . . . , 

» — 1:3 + 1 



It 



it 



(fe+1) 

»— 1)3 

(fe+1) 
i-1,3+1 



I \ I ffe + 1) \ 

( 1 , , \„(k) , ^JVfiK-l,j"i-l.j-l+ a i-l,j"i-2,j+ a i-l,j"i,j ) 

(1 - L}TfE)U i _ li j H ^p-j- 

,, , / W (fc) 1 B (k + 1) 1 JV CO \ 

(1 "SE(°i-l,j + l"i-2,j + l+ a i-l,j + l"i,j + l +°i-l, J - + l"i-l,j + 2) 

WSE)U>i-l t j T a p 



(3.8) 



Note that the new values of it. 



(fc+i) 



and i4^+i are located in right hand side of (3.8) as 



they are known after solution of (3.7). In the same manner other remaining boundary 
nodes will be updated. Then, the internal nodes will be updated using a classic SOR 
method along the corresponding sweeping direction. 

The next sweep (iteration (fc + l)-th) for the mentioned decomposition will be 
performed along the corresponding opposite directions, i.e., SW, SE, NW and NE 
for sub-domains 1 though 4 respectively (cf. figure 3.3). In this specific example there 
is no need to solution some local system of equations for this sweep; but in general one 
may usually needs to solve such systems at coupling points of sub-domains which are 
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determined based on sweeping directions. Figure 3.4 shows the alteration of sweeping 
directions during each four consecutive iterations for a 4 x 4 array of sub-domains. In 
the next following sub-sections some implementation issues of our algorithm will be 
discussed in details. 
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Fig. 3.4. Alternation of sweeping directions during 4 consecutive iterations for 4x4 array of 
sub-domains. 



As a final remark in this section it should be mentioned that the extension of our 
algorithm for nine-point or higher order central stencils is straightforward. As the ex- 
tend of computational molecule is increased in a higher order method, the size of local 
systems to be solved will be also increased accordingly. Following the same strategy, 
it would be possible to apply the presented method for other numerical methods, e.g., 
FVM and FEM methods, on a structured grid. Note that every structured grid has 
a one-to-one correspondence with with a Cartesian grid. Therefore, the presented 
method can be directly used for non-Cartesian structured grids too. To save the 
space, further details in this regards are ignored in this study. 

3.0.1. Domain decomposition. The first aspect of our parallel algorithm is 
to divide the computational grid into parts. This is how the full computational task 
is divided among the various processors. Each processor works on a portion of the 
grid and anytime a processor needs information from another one a message will 
be passed (in message passing protocol; no message passing is needed regarding to 
shared- memory machines) . 

To achieve a good parallel performance, we like to have an optimal load balancing 
and the least communication between processors. Consider the load balancing first. 
Assuming a homogeneous parallel machine, one would like that each processor does 
the same amount of work during every iteration. That way, each processor is always 
working and not sitting idle. It makes sense to partition the grid such that each 
processor gets an equal (or nearly equal) number of nodes to work on. The second 
criterion, needs to minimize the number of edge cuts and the degree of adjacency for 
each sub-domain (number of neighbors processor) . So the domain decomposition could 
be converted to a constrained optimization problem in which the communication cost 
to be minimized under the load balancing constraint. In the case of regular Cartesian 
grid, this optimization problem can be easily solved within a negligible cost. 

The cost of communication is composed from two types of elapsed time, one pro- 
portional to the communicated data size (function of network communication speed) 
and one due to the network latency which is independent from the data size. There- 
fore, dealing with a high latency networks, we may prefer to decrease the degree of 
adjacency in our communication graph; in expense of a higher size of communicated 
data size. On the other hand, dealing with low latency networks, we may neglect the 
effect of network latency. It is worth mentioning that, using an overlapped communi- 
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cation and computation strategy, it will be possible to decrease cost of communication. 

The above discussion is correct in general for every parallel algorithms. How- 
ever, one should consider other criteria when the convergence rate of the parallel 
algorithm is a function domain decomposition topology (number of sub-domains and 
their geometries). 

In the present study increasing number of sub-domains along each spatial direction 
is equivalent to decreasing the degree of coupling in that direction. Rough speaking, 
it usually decreases the convergence rate. However, this conclusion is not correct 
in general. In some cases, a limited degree of decoupling not only decreases the 
convergence rate but also improves it. As an example consider the solution of Poisson 
equation in a square domain. Also, assume that the center of domain is the center of 
symmetry for the exact solution. Now, let's to decompose the domain into 2x2 equal 
sub-domains. Due to mentioned symmetry, the parallel algorithm performs superior 
than the original SOR method (as it is more consistent to mentioned symmetry in 
actual solution). This example implies that decoupling does not always decrease the 
convergence rate. Therefore, one may expect that an algorithm with partial coupling 
(like that of ours) performs even superior than the original SOR in practice. 

3.0.2. Sweeping directions. After the domain decomposition, it is essential 
to determine the sweeping directions for each sub-domain during 4 consecutive iter- 
ations. For this purpose it is sufficient to use the following rules (lie ID version of 
algorithm) : 

- Along each spatial direction the sweeping direction of a sub-domain should 
be in opposite direction of its neighbor sub-domains (cf. figure 3.4). 

- The sweeping directions (along all spatial directions) should be reversed dur- 
ing every pair of iterations (cf. figure 3.4). 

- After each pair of iteration, the sweeping will be performed along a new 
diagonal, e.g., after two consecutive iterations along diagonal SW-NE, the 
direction of sweeping for the next two iteration will be changed to SE-NW 
diagonal (cf. figure 3.4). 

Using the above mentioned rules, the sweeping directions are generated automatically 
without any complexity, e.g., figure 3.4 is produced applying the above mentioned 
rules within a nested loop using a few lines of Tikz 1 scripts. 

3.0.3. Data structure. In the sequential SOR a two-dimensional array is suf- 
ficient to store the global data. This needs a nested loop to sweep the hole domain. 
Since in the presented parallel SOR method, we do not have a regular order of up- 
dating for sub-domains boundary nodes, it is preferable to store order of update for 
these nodes. To achieve better performance every data (mentioned indexes and field 
variables) are stored in a one-dimensional array (sometimes called as vectorization). 
In fact in this way we have a one-to-one mapping from 2D indexing to ID indexing, 
e.g., node is stored at location i + j * (n + 2) in the one-dimensional array. In 
this way, the location of nodes (i + 1, j) and (i, j + will be the location of node 
plus I and (n + 2) values respectively. 

To update nodes which are located at boundaries between neighbor sub-domains, 
the data from the corresponding neighbor sub-domains is required. Therefore, we 
add two rows/columns of halo points around each sub-domain which are renewed (via 



www.texample.net / tikz / 
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communication) prior to a new iteration (in the case of higher order schemes, the 
width of halo region should be increased accordingly) . In the present implementation 
every sub-domain has at most 8 neighbors, i.e., 4 first-degree neighbors (contacted 
via an edge), 4 second-degree neighbors (contacted via a corner). Figure 3.5 shows 
schematically the communicated data during SW to NE and NE to SW sweeps for 
a sub-domain with 8 active neighbors. In this figure, yellow-shaded regions contain 
native nodes and gray-shaded regions includes required data which should be com- 
municated. It is clear that it is possible to overlap a portion of communications with 
computations to improve the performance. 



SW-to-NE NE-to-SW 

Fig. 3.5. Schematic of data communication during frontal sweeping SW to NE (left) and NE 
to SW (right) for a sub-domain with 8 active neighbors: yellow shaded regions contain native nodes 
and gray shaded regions include communicated nodes. 



4. Extension to three-dimensions. Following the geometric procedure dis- 
cussed in the previous section, it is straightforward to extend the presented method 
to three-dimensional (3D) cases. The main ingredients of such extensions are briefly 
addressed here. 

In 3D case, there are eight alternatives frontal sweeping directions. Adding front 
and back directions to mentioned geographic directions (east, west, north, south), 
these eight sweep directions are as follows: BSW-to-FNE, FNE-to-BSW, BNE-to- 
FSW, FSW-to-BNE, BNW-to-FSE, FSE-to-BNW, BSE-to-FNW and FNW- 
to-BSE. There are eight relaxation parameters accordingly, uibsw, ^fne, ^>bne, 
Wfsw, wbnWj ufse, ubse and uifnw- After the domain decomposition, the sweep- 
ing directions are determined for each domain for each eight consecutive iterations 
using mentioned rules in 3.0.2. Using a Cartesian decomposition topology, each sub- 
domain will has at most 26 neighbors, i.e., 6 first-degree neighbors (contacted via a 
face), 12 second-degree neighbors (contacted via an edge), 8 third-degree neighbors 
(contacted via a corner). 

Same as 2D version of algorithm, some local system of equations are needed to 
be solved at sub-domains coupling nodes. Now, consider a sub-domain with 26 active 
neighbor. At starting corner, a 8 x 8 system of linear equations should be solved. Note 
that this system is not dense (includes 32 none-zero entries) . Then, along each of the 
three coupling edges, a sort of 4 x 4 local system of linear equations should be solved 
in an appropriate order. In the same manner, on each of the three corresponding 
coupling faces a sort of 2 x 2 local systems of equations should be solved based on an 
appropriate update order. Updating coupling nodes, the remaining nodes are updated 
using classic SOR method along the corresponding sweep direction. 
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5. Parallel incomplete LU preconditioners. The SOR method does not usu- 
ally used as an iterative solver in practice. But it is commonly used as a smoother 
in multi-grid methods or as a preconditioner in Krylov subspace methods. In these 
cases, a few iterations of SOR method is applied during each step of precondition- 
ing (or smoothing). Another kind of preconditioners which are extensively used in 
Krylov subspace methods are incomplete LU (ILU) preconditioners. However due to 
sequential nature of incomplete factorization and also forward/backward subsituation 
procedure, application of ILU methods in parallel is challenge. 

In this section, we show that the presented parallel method can be equivalently 
used to develop parallel ILU preconditioners. A simple observation shows that single 
pass of a symmetric Gauss-Seidel method on a n-diagonal matrix (precisely a matrix 
includes only a few non-zero diagonals with symmetric sparsity pattern) is equivalent 
to ILU(O) preconditioning (cf. section 10.3 of [21]). Therefore, the presented parallel 
algorithm can be used as a parallel ILU(O) preconditioner as well. Similarly, ILU(p) 
preconditioning for such structured matrices is equivalent to application of our method 
for a higher order method (cf. section 10.3 of [21]) which make sense to use our method 
to parallelize ILU(p) preconditioners as well. 

Later, we will show that our method can be applied for n-diagonal matrices raised 
from discretization of PDEs on structured grids. Therefore, the presented strategy 
can be also regarded as a parallel ILU(p) preconditioner for such n-diagonal matrices. 

6. Convergence analysis. In this section we first proof the convergence of the 
presented parallel algorithm in one- and two- spatial dimensions. Following the same 
line of proof, the extension of theory to higher dimension will be straightforward. 

6.1. Mono-dimensional convergence analysis. To simplify the analysis, it is 
worth to write the presented method in an algebraic form. Without loss of generality, 
assume that the spatial domain is decomposed into p number of sub-domains where 
p is an even integer and all sub-domains include I number of grid points (note that 
boundary nodes are not considered here, i.e., n = pi). Also, assume that the sweeping 
direction of the first sub-domain is LR at the starting iteration. We can decompose 
matrix A in (2.3) as follows, 



A = G L + pI + G R , 
where I denotes an n x n identity matrix here, 



(6.1) 



Li 



Ri 



Lp-i 



Rh 



, Gr — 



Ri 



Rp-i 



L 



p J 



for q = 2,3,... ,p, 



L q = 



~ c (q-l)l + l C(q-l)l + l 

~ C (q-l)l+2 c (g-l)/+l 
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-C q l-1 C q l-l 

-Cql Cgl 



lx(l+l) 



for q = l,2,...,p-l, 



a (q-l)l+l - a (q-l)l + l 



-ffl(?-l)i+2 



O-ql-1 



Zx(i+1) 



Li = 



-ca 



Q-i 



Ori-2+2 



Note that matrices L q and i? 9 should be assembled in Gl and Gr such that their 
positive diagonal match to the main diagonal of Gl and Gr. Assume that eigen- 
values of matrices Gl and Gr are denoted by vectors A L = [\[, X l 2 , ■ ■ ■ , X l n ] T and 
Ar = [AJ, Aj, . . . , A^] T respectively. The following Lemma determines A L and A R 
explicitly. 



Lemma 6.1. Gl and Gr, are strictly positive definite and we have, 



Al — [Cl, . . . , Cj, az+l, . . . , d2i, . . • , C(p_ 2 )Z + l! • ■ ■ , C(p_x)J, fltn-J+l, • 



Ar — [ai, . . . , a;, cj+i, . . . , c 2 ;, . . . , a( p _ 2 );+i, . . . , a( p _ 1 ) i , c„_/ + i, . . . , c„] 

Proof. Considering the structures oi Gl and G/j, it is easy to verify that Gl and 
Gr are block lower and upper triangular matrices. Note that the definition of both 
of the block lower triangular and upper triangular matrix hold for each of Gl and 
Gr. Therefore, their corresponding eigenvalues are equal to union of their diagonal 
blocks's eigenvalues. The Diagonal blocks have a simple bi-diagonal structure and 
their eigenvalues are equal to their main diagonals. Considering the fact that that 
Oj, Ci > 0, i = 1, . . . , n completes the proof. □ 

Theorem 6.2. The presented parallel one- dimensional algorithm is convergent 
if\l-u L \ \1-wr\ < 1. 

Proof. Considering the decomposition (6.1) of matrix A, with simple algebra, we 
can write the presented one-dimensional parallel iterative algorithm in the following 
matrix form (k = 0, 2, • • • ), 

((b - lu l A l )I + ojlGl) u( fc+1 ) = (((1 - u L )h + u L A R )I - u L G R ) u« + u L f 

((b - lu r Ar)I + urGr) u( fe + 2 ) = (((1 - Lu R )b + u R A L )I - lorGl) u^ +1 ) + w a f , 

(6-2) 

where b = [bi, &2> • • • , b n ] T . From (6.2) we have, 

u (fc+2) = Tu (fe) + f = 0, 2, • • • . (6.3) 
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in which, 



T = ((b - u R A R )I + u R G R ) (((1 - w R )b + uj r A l )I - u R G Lj 

x((b-w L A L )/ + w L G L ) l (((l-u L )b + u L A R )I-u L G R ^ (6.4) 

To proof the convergence of iterations (6.3) it is remained to show that the spectral 
radius of iteration matrix T which is denoted by p(T) here is strictly less than unity 
(cf. [21]). Now, let's to define the matrix T which is equivalent to T (has the same 
spectral radius), 

(6.5) 



(6.6) 



f = ((b - w R A R )I + uj r G r j T ((b - to R A R )I + uj r G r 
Using (6.4), T can be written as follows, 

f = (((1 - u R )b + lo r A l )I - u R G L ) ((b - u L A L )I + u L G L ) 
x (((1 - uj L )b + uj l A r )I - uolGr^j ((b - u> R A R )I + w fl G fl ) 
Let's to define the following matrices 

G L - = ((1 - u R )b + u R A L )I - uj r G l , G L+ = (b - lo l A l )I + lu l G l 

Gr- = ((1 - oj L )b + oj l A r )I - uj l G r , G B + = (b - u R A R )I + w R G R 

Assume eigenvalues of Gl-, Gl+, G R - and G R + are denoted by the following vectors 
respectively, 



A, 



[A t , . . . , A n 



;-it 



A 



L+ 



[\[ + ,...,x l n + ] T , 



A R - — [X[ , • ■ • , Xn ] T , A R+ — [X[ + , . . . , X r n + ] T , 
It is easy to show that, 

A L - = (1 - Wfl)b, A L+ =b, A R _ = (1 - oj L )b, A R+ =b. 
Using the knowledge of linear algebra we have, 

p(T) = p(f) = || f || sc || (G £ _) (G-|) || x || (G fl _) (G-i ) || 
where the norm operator || • || is understood as the Euclidean norm here, 



|| {G L -) (G2|) I) = max 



/ + 



max 



(1 - 



|| (G R -) (G r \) II = max 



AT 



Therefore, if, 



max 



(1 - uj L )bj 



|l-Wi| |l-w*| < 1. 
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p{T) < 1 which complete the proof. □ 



Corollary 6.3. The presented one- dimensional parallel Gauss-Seidel algorithm 
(ljl = ujr = 1) is unconditionally convergent. 

Corollary 6.4. When ujl = = u) the convergence criterion of the presented 
one- dimensional parallel SOR algorithm is similar to that of sequential SOR method, 
i.e., < u) < 2. Moreover uj opt — 1 in this case. 

Remark 6.5. Using the same procedure, it will easy to show that the coefficient 
matrix A corresponding to a high order methods (includes more extended computa- 
tional molecules) admits a decomposition like to (6.1), where Gl and Gr are two 
strictly positive definite block lower and upper triangular matrices respectively. Then, 
it will be straightforward to proof the convergence similarly. 



6.2. Two-dimensional convergence analysis. Following the previous sec- 
tion, it is possible to proof the convergence of the presented method in two spatial 
dimensions. To illustrate this issue, the convergence of the presented two-dimensional 
algorithm for discrtized equation (3.2) will be discussed in this section. 

Extension of mono-dimensional proof to higher dimensions is based on tensor 
product properties of the structured grids. The standard five-point discretization of 
(3.1) problem on an n x m structured grid (ignoring Dirichlct boundary nodes) leads 
to the following system of linear equation (the effects of Dirichlet boundary nodes are 
included in the right hand side vector), 



Au = f 



where strictly positive definite nm x nm dimensional matrix A has the following 
structure (row- wise ordering is assumed for grid points), 



.4 



B 1 A 1 

Ci B2 

c 3 



-4, 



a 



A m -i 
B,, 



m 



where each of Bi, Ai and Q is an n x n multi-diagonal (three or mono-diagonal) 
matrices. For j = 2, . . . , m — 1, 



r w p e 
'' 1 j " 1 • . a ij 

n W 



a n-l,j 
W P E 

a n,j a n,j a n.j 
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nx (ri+2) 



Bi 



"1,1 

n w 
a 2.1 



"11 
a 3,l 



a 2,l 



W 
l n,l 



l n-l,l 
a n.l 



l n,l 



nx (n+1) 



a l,m 



and Aj = [diag(— a 



N 

hp' 



"l,m 
a 2,m 



n :J 'Jfl> 



*n— l.m 



n E 
n P 



[diag(- 



nx (n+l) 
S •)! • 



Without loss of generality, assume that the spatial domain is decomposed into 
p x x p y number of sub-domains, where p x and p y are even and every sub-domain 
includes l x x l y number of grid points (n = p x l x and m — p y l y ). Also, assume that 
the sweeping direction of the first sub-domain is NE-to-SW at the starting iteration. 
It is easy to show that the coefficient matrix A admits the following decomposition 
(notice that the natural row- wise ordering is considered here), 



A = G 1 + /3I + G 2 , 
where / denotes an n x m identity matrix here, 



(6.7) 



Si 



Gi 



G 2 



J Py 



for q = 1,. .. ,p y - 1, 



N„ 



H( q -l)l y + l P(q-l)l y + l 

H(q-l)l v +2 P(q-l)l y +2 



for q = 2, . .. ,p y , 



S q = 



Q(q-l)l v + l V{q-l)l y +l 

Q(q-l)l v +2 V {q _ 1)ly+2 



Q 



No 



S Pv -i 



N 



iql v —l 
Hqly Pql y _ 



lyX(l y +l) 



qly-1 



Qqly Vqly J l y x(l v + l) 
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N n „. = 



H(Py-l)ly + l P( P y-l)l y +l 



H Py i y -i P p 



H. 



Pyly 



Si 



v 1 

Q2 v 2 



Qk v t 



y v y J 



where P, = A,- (for j = 1, . . . , m— 1), Qj = Cj (for j = 2, . . . , to) and for j 



Hi 



NE 



hi 



NW- 



2,j 



S3 



2,3 



sw p .- ld 



SE. 



P x ,3 



denoting aff = af } + af_ rj , for r = 1, . . . ,p x — 1, j = 1, . . . to, 



a (r-l)( x + lj - a ( r -l)/ x + l,j 



"(r-l)/ x +2j "(r-l)/ a: +2j 



JVE _ E 

a rl x -l,j a rl x -l, 



,NE 



l rl x ,j 



denoting a?™ = a% + af } , for r = 2, . . . , p x , j = 1, . . . to, 



a (r-l)l«+l,j a (r-l)J x +l,j 



-a (r-l)«*+2,j °(r-l); x +2J 



u rl x -l,j u rl x -l 



-a 



w 



18 



denoting a 



sw _ n w 



,j , afj, for r = 2, ...,p x - 1, j = l,...m, 



SW rJ 



w SW 
- a (r-l)l x + l,j a (r-l)l x + l,j 



W SW 
\r-l)l x +2,j a (r-\)l x +2,j 



,,W n SW 
J rl x -l,j a rl x -l,j 



W SW 
a rl x ,j a rl x ,j 



l x x(l x +l) 



denoting aff = af 3 + afj, for r = 2, . . . ,p x - 1, j = 1, . . . m, 



-.SE £/ 

X (r-l)'x + lj V-l^+lj 



l fr--l)/ a: +2j a (^-l); a: +2j 



~.SE n E 

l rl x -l,j U rl x -l,j 

,.SE 
a rl x ,j 



l rl x ,j 



Considering effects of boundaries, 



r n sw 



w n sw 

a 2,J a 2,j 



sw hj = 



_ w sw 

l X lyj ° X ±ij 



_ W SW 
a l*,3 a l x ,3 



;-Xl X 



ixx(; x +i) 



r „se 



SE. 



P x ,3 



l n-l x +l,j 



-a 



E 

(n-l x + l,j 



,SE 

1 n -i x +2j 



l n-l x +2,j 



n SE — n E 
"n-lj n-l,j 



-,SE 



l x xl x 



Note that matrices N q and S q should be assembled in G\ and G 2 such that Hj and 
Vj entries are matched on the corresponding main diagonals, in the same manner, 
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matrices NE r j, NW r j, SE r j and SW r j should be assembled in Hj and Vj such that 
their positive diagonals are matched on the corresponding main diagonals. 

Using the same approach, parallel sweeping along the other diagonal, NW-to-SE, 
implies the following decomposition, 

A = G 3 + I3I + G 4 , (6.8) 

Since there is a clear analogy between decompositions (6.7) and (6.8), we avoid fur- 
ther details about G3 and G4 too save the space. In fact, during a NE-to-SW sweep, 
matrices G\ and G2 respectively include the implicit and explicit parts of the compu- 
tational molecule related to the discrtized diffusion operator. During a SW-to-NE 
sweep, the position of implicit and explicit parts will be reversed. 

To simplify the proof of convergence, we define some helpful matrices and mention 
their elegant properties here. 

Definition 6.6. Multilevel matrix splitting: Consider n x n matrix A and in- 
teger set S = {mi, m.2, ■ ■ ■ , m s } such that 1 ^ s ^ n, ^ 1 (i = 1, . . . , s) and 
Si=i m i = n - The multilevel splitting of matrix A under set S which is denoted by 
MLSs.(A) is defined as set {Ai, A2, . . . , A s } such that every matrix Ai is composed 
from the first mi row and columns of matrix Ai—\ (i = 1, . . . , s), where Aq = A. 

Definition 6.7. Alternatively lower-upper triangular matrix: An n x n matrix 
A is called alternatively lower-upper triangular, if there is a multilevel splitting set 
H = {mi, . . . , m s } such that every Ai € MLSs.{A) is either a lower triangular matrix 
or an upper triangular matrix. 

Definition 6.8. Alternatively block lower-upper triangular matrix: An n x n 
matrix A is called alternatively block lower-upper triangular, if there is a multilevel 
splitting set S = {mi, . . . , m s } such that every Ai € MLSs{A) is either a block lower 
triangular matrix or a block upper triangular matrix. 

Corollary 6.9. Assume that n x n matrix A is an alternatively lower-upper 
triangular matrix, then det(A) — Jl"=i a a an d the set eigenvalues of matrix A are 
equal to set of diagonal entries. 

Proof. The proof is evidently followed considering the definition of the matrix 
determinant. □ 

Corollary 6.10. Assume that n x n matrix A is an alternatively block lower- 
upper triangular matrix, then det(A) is equal to the product of determinant of its 
diagonal blocks and the set eigenvalues of matrix A are equal to union of set of eigen- 
values corresponding to the diagonal blocks. 

Lemma 6.11. Assume that nx n matrix A is an alternatively lower-upper trian- 
gular matrix, then there is a set of permutations with cardinality m ^ n that maps A 
to either a lower triangular or an upper triangular matrix. 

Lemma 6.12. Assume that n x n matrix A is an alternatively block lower-upper 
triangular matrix, then there is a set of permutations with cardinality m ^ n that 
maps A to either a lower block triangular an upper triangular triangular matrix. 
Since Lemma 6.11 and 6.12 do not play any role in the course of our analysis, we leave 
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their proof to interested readers. 



COROLLARY 6.13. Matrices G\, G%, G3 and G4 in (6.7) and (6.8) are alterna- 
tively block lower-upper triangular. 

Proof. The proof is evident, considering the structures of G\, G2, G3 and G4 
illustrated in this section. □ 

It is worth mentioning that the same result is hold for sub-matrices Hj and Vj which 
is summarized in the following corollary. 

COROLLARY 6.14. Matrices Hj and Vj (j — l,...,m), are alternatively block 
lower-upper triangular. 




global ordering 1 global ordering 2 



Fig. 6.1. Decomposition of a 6 X 6 Cartesian mesh into 2x2 array of sub-domains: sweeping 
directions and global order of updating for natural row-wise ordering (left) and parallel sweeping 
ordering (right). 

Lets to visualize results of Lemma 6.12 and Corollary 6.13 through a simple 
example. For this purpose we use 2x2 decomposition of the 6x6 spatial grid shown 
in left side of figure 3.3. To show the global sparsity pattern, we should first convert 
local indexing into a global one. We use two type of global ordering here. The first 
one based on natural row-wise ordering (cf. left hand side of figure 6.1). The second 
one is globalization of update order according to our algorithm. Assuming local 
updating index of a node #n is in Qj is denoted by ii, its global index is computed 
as i g = 4 * (ii — 1) + j for coupling nodes and the remaining nodes are indexed based 
on priority of their domain ID (cf. right hand side of figure 6.1). The global sparsity 
pattern of A, G\ and G2 for these two case of global ordering is shown in figure 
6.2. The plot clearly confirms the results of corollary 6.13. It is worth mentioning 
that, the global ordering based on the presented parallel sweeping algorithm defines a 
permutation to map the alternatively block lower-upper triangular matrices G\ and 
Gi to their corresponding block lower (upper) triangular (compare upper and lower 
part of figure 6.2). 

Assume that eigenvalues of matrices Gi, G2, G3 and G4 in (6.7) and (6.8) are 
denoted by the following vectors vectors respectively 

Ai = [A}, A2, . . . , A*] T , A 2 = [A? , A§, • ■ • , A^] T , 
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A Gi G 2 

Fig. 6.2. Decomposition 0/06x6 Cartesian mesh into 2x2 array of sub-domains: global 
sparsity pattern of A, G± and G2 for global ordering based on natural row-wise ordering (up) and 
parallel sweeping ordering (bottom); cf. figure 6.1. 

^3 = [A?, A§, . . . , A^] T , A 4 = [Af, A|, . . . , \ n ] T ■ 
The following Lemma computes these values explicitly. 

Lemma 6.15. The following relation holds for sets of eigenvalues of G\, Gi, G3 
and G4 in (6.7) and (6.8), 

Ai, A 2 , A 3 , A 4 c{</, aff, aff , aff \ i = 1, . . . ,n; j = 1, . . . ,m }. 

Proof. Considering Corollary 6.13, the eigenvalues of Gi, G 2 , G 3 and G4 are 
equal to their diagonal entries. The diagonal entries are equal to diagonals of Hi and 
Vj for j = 1, . . . , m which complete the proof. □ 

Corollary 6.16. Matrices G\, G2, G3 and G4 in (6.7) and (6.8) are strictly 
positive definite. 

Theorem 6.17. The presented parallel two-dimensional algorithm is convergent 
if the following condition holds, 

|1 - usw\ |1 - u S e\ |1 - unw\ |1 - u NE \ < 1. 
Proof. Let's to define the following temporary notations to concise expressions, 

def def def def 

U)\ = UJNE, W 2 = LOSW, w 3 = ^>NW, w 4 = USE- 
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Inspiring from the ID analysis, we consider every four iterations of the presented 
method as a group of iterations (sweeping along NE-to-SW, SW-to-NE, NW-to- 
SE and SE-to-NW respectively). Considering the decomposition (6.7) and (6.8) 
of A, we can write the presented two-dimensional parallel iterative algorithm in the 
following matrix form (k = 0, 4, ■ ■ • ), 

((a - wiAi)/ + wiGi) u( fe+1 ) = (((1 - wi)a + wiA 2 )7 - u>iG 2 ) u<» + wif, 
((a - lo 2 K 2 )I + w 2 G 2 ) u( fc + 2 ) = (((1 - u: 2 )a + cj 2 Ai)7 - u^Gij u (k+1) + w 2 f, 
((a- W3 A3)/ + w3G 3 ) u( fc + 3 ) = (((1- W3 )a + W3A4)/-W3G 4 ) u( fe + 2 )+ W 3f, 
((a - w 4 A 4 )/ + a; 4 G 4 ) u( fe + 4 ) = (((1 - w 4 )a + W4A3)/ - a; 4 G 3 ) u( fe + 3 ) + w 4 f , 

(6.9) 

where vector a includes diagonal entries of A with an order consistent to G^. From 
(6.9) we have, 



where, 



u (fc+4) = Tu (fe) + f ( fc = 0, 4, • • • . (6.10) 

T= ((a- W4A4)/ + W4G4) (((1 - uj 4 )sl + W4A3)/ - W4G3) 
x ((a - W3A3)/ + W3G3) f ((1 - w 3 )a + W3A4)/ - W3G4) 
x ^(a — uj 2 A 2 )I + (jj 2 G 2 ^J f ((1 — cj 2 )a + uj 2 A 1 )I — w 2 Gi^ 



x((a-cJiAi)I + cJiGiJ \((l-ui)a + wik2)l -u)iG 2 ) (6.11) 
With some simple manipulations of iteration matrix T, we have, 

f = (Gi_) (G.+y 1 (G 2 _) (G2+)- 1 (G 3 -) (G3+)- 1 (G 4 _) (G 4+ )- x (6.12) 
such that p(T) = p(T), and, 

Gi+ = (a - W1A1)/ + W1G1, Gi_ = ((1 - cj 2 )a + w 2 Ai)7 - cj 2 Gi 

G 2+ = (a - cj 2 A 2 )7 + lu 2 G 2 , G 2 _ = ((1 - wi)a + WiA 2 )7 - cjiG 2 

G 3 + = (a - W3A3)/ + o; 3 G 3 , G 3 _ = ((1 - w 4 )a + w 4 A 3 )/ - w 4 G 3 

G 4+ = (a — W4A4)/ + W4G4, G 4 _ = ((1 — w 3 )a + W3A4)/ — w 3 G 4 

Assume eigenvalues of Gi_, Gi + , G 2 _, G 2+ , G 3 _, G 3+ , G4_ and G4 + are denoted 
by vectors A x _ = {A, 1 -}, A 1+ = {A 4 1+ }, A 2 _ = {A}"}, A 2+ - {A*+}, A 3 _ = {A 3 ~}, 
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A 3 + = {A^ + }, A 4 _ = {A* } and A 4+ = {A^ + } (for i = 1, ... ,mn) respectively, it is 
easy to show that, 

Ai_ = (1 - w 2 )a, A i+ =a, A 2 _ = (1 - wi)a, A 2+ = a, 



A 3 _ = (l-w 4 )a, A 3+ =a, A 4 _ = (1 - w 3 )a, A 4+ = a. 
Therefore, we have (|| • || denotes the Euclidean norm), 

||T || < ||(Gi-) (GX+)- 1 !! ||(G 2 _) (G* 2+ )- 1 || ||(G 3 _) (G3+)- 1 !! ||(G 4 _) (G i+ y l \\ 



max 

l<i<n 




\l-w 2 \ |l-w 4 | |l-w 3 |. 



(6.13) 



Since, p(T) = p(T) = ||T||, using (6.13), the following condition ensures the conver- 
gence of iterations: |1 — |1 — w 2 | 1 1 — CJ3 1 |1 — w 4 | < 1. D 

Corollary 6.18. The presented two-dimensional parallel Gauss-Seidel algo- 
rithm (lone — losw = lonw — lose = 1) is unconditionally convergent. 

Corollary 6.19. When lone — losw = lonw — lose = lo the convergence 
criterion of the presented two-dimensional parallel SOR algorithm is similar to that 
of seguential SOR method, i.e., < lo < 2. Moreover, in this case Lo opt = 1. 



7. Extension to general structured n-diagonal matrices. Using the Carte- 
sian tensor product properties of structured grids and following the same procedure 
mentioned in the previous sections, the extension of our method to three (or higher) 
spatial dimensions is straightforward. Therefore, it is not of value to further discuss 
on this issue. Now, let us to generalize our method a bit more. 

Definition 7.1. (Cartesian grid) The spatial network Z c R d (d ^ 1) is called 
as a Cartesian grid inM. d if it is formed by tensor product of d one- dimensional spatial 
grids. 

Definition 7.2. (Structured grid) The spatial network Z C R d (d > 1) is called 
as a structured grid in M d if there is an orientation preserving deformation in M d 
which maps Z to a Cartesian grid in M. d . 

It is clear that every structured grid in R d has 2 d corner points and so 2 d directed 
diagonal directions (sweeping directions in our algorithm). 

Definition 7.3. (Structured n-diagonal matrix) Consider the following assump- 
tions: Z C K d is structured grid with nii grid points (i — 1, . . . , d) along each spatial 
dimension. Every grid point of Z has an ID number based on a natural row-wise 
ordering. A is a square matrix with cardinality n — IliLi m i- ^ * s an n-diagonal 
matrix with a symmetric sparsity pattern (n is odd). Then we called A as a structured 
n-diagonal matrix if there is a computational molecule M which maps nonzero entries 
of every row i of A to a set of points in the neighborhood of node i in Z ; and vice 
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versa. 

We assume that the reader is aware from the required special treatment at/near 
boundaries of Z. Therefore, these issues are ignore within Definition 7.3. In fact by 
Definition 7.3 we would like to extend application of our method to every matrix A 
which can be connected to a structured grid, e.g., it can be raised from discretization 
of a PDE on a structured grid. 

Now, we want to generalize previous results, i.e., solving (preconditioning) system 
of linear equations Au = f by parallel SOR (or ILU) when A is an irreducible diag- 
onal dominant structured n-diagonal matrix. Let us to briefly mention the outline 
of algorithm in this case. Assume Z C K d is the corresponding structured grid to 
A. We first decompose Z into parts (based on load balancing criteria) and define 
the sweeping directions for each sub-domain during the course of each 2 d iterations. 
Then, order of update for coupling points are determined; note that the coupling 
points are determined based on the current sweeping direction and the computational 
molecule. Finally the iterative procedure is started by updating coupling points in the 
corresponding order and then internal nodes. Now let us to re-state the convergence 
theories for this general case. Although, we do not proof the following results, their 
proof should be straightforward (though can be very massive if one does not find an 
appropriate abstraction in this regard). 

Theorem 7.4. Consider irreducible diagonal dominant structured n-diagonal 
matrix A with the corresponding structured grid Z C R d . The the following results 
hold. 

(i) Matrix admit (2 d )/2 decomposition in the following form such that every de- 
composition correspond to parallel sweeping along an appropriate diagonal of 
Z: A = Gj + D + Gj+i for j = 1, 3, . . . , d — 1, where D is a diagonal matrix 
with nonnegative entries, for i = 1, ...,d, each Gi is strictly positive defi- 
nite alternatively block lower-upper triangular matrix with eigenvalues equal 
to main diagonal. 

(ii) Assume that for i = 1, . . . , d; uii denotes the relaxation parameter along the 
i-th sweeping direction. Then the presented parallel SOR algorithm is conver- 
gent under the following condition: ]X = i |1 ~ w i\ < 1- 

8. Numerical experiment. In this section we present numerical results on the 
convergence and performance of the presented method. As the computing platform 
we used a 32-core mechine based on 16 Dual-Core Intel 2400 Mhz nodes (4GB RAM 
on each node) switched with a Gbit Ethernet network. The MPICH [8] library was 
used for message passing implementation of exchange algorithm. 

The d-dimensional (d = 1,2,3) version of following laplace equation is used here 
as a model problem, 

i—d / q2 \ i—d 

where Xi denotes the i-th component of spatial position vector, e.g., for d = 3, x = 
(#1, x-2, xs) — (x, y, z). The exact solution of (8.1) is u(x) = Yll=i x i- Since this exact 
solution is spatially d-linear, it is possible to recover the exact solution by a spatially 
second order numerical method, independent of the grid resolution. Therefore, the 
discretization errors does can not contribute in our numerical analysis. In all of our 
numerical examples in this study, the Li-norm of error (with respect to the exact 
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solution) is used to study the convergence. Assume that the total number of degree 
of freedoms is n after the discretization of (8.1), the Li-norm of error is computed 
as follows, L\ — error = A SzLi \ Ul ~ ^'1' wnere u t an d Ui denote respectively the 
approximate and exact solutions at the i-th degree of freedom. Moreover, the initial 
guess is taken to be ui = (for I = 1, ...,n) and the iterations is stopped when the 
-Li-norm of error is below threshold l.e — 3; except in our 3D experiments in which 
the convergence threshold l.e — 2 is taken into account. 

Although we do not plan to release a software related to the presented method, 
some (crude) Fortran codes related to model problems used in this study are freely 
available on the following web (mainly to demonstrate the validity of results presented 
in this section): http : //sites. google.com/site/rohtav/home/par. 

8.1. One-dimensional numerical experiment. In our one-dimensional nu- 
merical experiments (8.1) is solved on uniform grids with resolutions, 41, 81 and 161. 

In the first ID example, the model problem is solved by In this left to right 
sweeping Gauss-Seidel (LRGS), right to left sweeping Gauss-Seidel (RLGS), symmet- 
ric sweeping Gauss-Seidel (SGS) and the presented parallel Gauss-Seidel with various 
number of sub-domains (PGS(p), p denotes the number of sub-domains). 

Table 8.1 shows the results of this experiment. It is clear that the presented 
method is convergent independent from the number of sub-domains, and the number 
of iterations to meet the convergence criterion is close to that of the original Gauss- 
Seidel methods. 



Table 8.1 

ID parallel Gauss-Seidel vs. ID sequential Gauss-Seidel. 



method 


41 grids 




81 grids 




161 grids 




iteration 


Ll -error 


iteration 


Li -error 


iteration 


Li-error 


LRGS 


979 


9.94266e-4 


3905 


9.98916e-4 


15598 


9.99738e-4 


RLGS 


960 


9.94266e-4 


3866 


9.98916e-4 


15519 


9.99738e-4 


SGS 


976 


9.96647e-4 


3892 


9.99752e-4 


15565 


9.99977c-4 


PGS(2) 


975 


9.95198e-4 


3891 


9.99297e-4 


15563 


9.99825e-4 


PGS(4) 


974 


9.97789e-4 


3890 


9.99986e-4 


15561 


9.99779e-4 


PGS(6) 


974 


9.94523e-4 


3890 


9.99231e-4 


15559 


9.99769e-4 


PGS(8) 


973 


9.97436e-4 


3890 


9.98472e-4 


15557 


9.99807e-4 


PGS(IO) 


973 


9.94144e-4 


3889 


9.99221e-4 


15555 


9.99809e-4 


PGS(14) 


972 


9.93927e-4 


3888 


9.99208e-4 


15551 


9.99755e-4 


PGS(18) 


970 


9.99382e-4 


3887 


9.99179e-4 


15547 


9.99783e-4 


PGS(24) 






3885 


9.99857e-4 


15541 


9.99814e-4 


PGS(30) 






3884 


9.99459e-4 


15535 


9.99738e-4 


PGS(36) 






3882 


9.99864e-4 


15529 


9.99704e-4 



In the second ID example the model problem is solved by left to right sweep- 
ing SOR (LRSOR), right to left sweeping SOR (RLSOR), symmetric sweeping SOR 
(SSOR) and the presented parallel SOR with various number of sub-domains (PSOR(p), 
p denotes the number of sub-domains) . The optimal values of uj are applied in these 
experiments. These values are primarily computed via numerical experiments (search- 
ing in [1.0, 2.0] with l.e — 3 accuracy). 

Tables 8.2, 8.3 and 8.4 show results of these experiments. According to tables, 
PSOR method is convergent and has a competitive convergence rate with sequen- 
tial SOR methods. Since the initial error had an asymmetric spatial distribution, 
the sweeping directions has considerable effect on the iteration counts. For example 
RLSOR method has the best convergence rate. Note the presented parallel solver 
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overcomes LRSOR and SSOR in some cases. 



Table 8.2 

ID parallel SOR vs. ID sequential SOR for grid resolution 41- 



method 


iteration 


Li -error 


ujl- optimum 


lur- optimum 


LRSOR 


51 


9.99298c-4 


1.86887 




RLSOR 


31 


9.38294c-4 




1.86637 


SSOR 


62 


9.48613c-4 


1.00000 


1.87776 


PSOR(2) 


31 


9.46469e-4 


1.84970 


1.92084 


PSOR(4) 


76 


9.88365e-4 


1.00000 


1.94890 


PSOR(6) 


78 


9.88246e-4 


1.00000 


1.94890 


PSOR(8) 


72 


9.85974e-4 


1.00000 


1.89379 


PSOR(IO) 


71 


9.72100e-4 


1.00000 


1.88577 


PSOR(14) 


87 


9.59733e-4 


1.00000 


1.95591 


PSOR(18) 


71 


9.60855c-4 


1.00000 


1.90080 






Table t 


i.3 






ID narallel SOR 


vs. ID sequential SOR for grid resolution 81. 


method 


iteration 


L\ -error 


u>l- optimum 


ujr- optimum 


LRSOR 


103 


9.82824e-4 


1.93193 




RLSOR 


61 


9.42462e-4 




1.93143 


SSOR 


120 


8.87046e-4 


1.00000 


1.93487 


PSOR(2) 


80 


9.54235e-4 


1.90982 


1.95691 


PSOR(4) 


164 


9.93129e-4 


1.00000 


1.97194 


PSOR(6) 


129 


9.66998e-4 


1.00000 


1.96092 


PSOR(8) 


141 


9.91414c-4 


1.00000 


1.94589 


PSOR(IO) 


140 


9.80610e-4 


1.00000 


1.94088 


PSOR(14) 


129 


9.35418e-4 


1.00000 


1.95992 


PSOR(18) 


138 


9.96443c-4 


1.00000 


1.95090 


PSOR(24) 


141 


9.93632c-4 


1.00000 


1.94790 


PSOR(30) 


180 


9.91055e-4 


1.00000 


1.97094 


PSOR(36) 


143 


9.91698c-4 


1.00000 


1.94389 






Table t 


3.4 






ID parallel SOR 


vs. ID sequential SOR for grid resolution 161. 


method 


iteration 


L\ -error 


td£,- optimum 


ujr- optimum 


LRSOR 


208 


9.81709e-4 


1.96593 




RLSOR 


122 


9.58420e-4 




1.96493 


SSOR 


236 


9.99534e-4 


1.19840 


1.96693 


PSOR(2) 


233 


9.82315e-4 


1.00000 


1.96593 


PSOR(4) 


342 


9.84557c-4 


1.00000 


1.98497 


PSOR(6) 


253 


9.85584e-4 


1.00000 


1.97996 


PSOR(8) 


279 


9.72206e-4 


1.00000 


1.97194 


PSOR(IO) 


277 


9.88102e-4 


1.00000 


1.96994 


PSOR(14) 


280 


9.98979e-4 


1.00000 


1.96994 


PSOR(18) 


269 


9.91725e-4 


1.00000 


1.97495 


PSOR(24) 


277 


9.89229e-4 


1.00000 


1.97395 


PSOR(30) 


278 


9.81003e-4 


1.00000 


1.96894 


PSOR(36) 


281 


9.73789c-4 


1.00000 


1.97194 



Since the optimal values of in SSOR and PSOR solvers are close to unity, it 
suggests to apply simultaneous over-relaxation and under-relaxation to improve the 
convergence. Results of symmetric successive over/under-relaxation SOR (SSOUR) 
and parallel successive over/under-relaxation SOR (PSOUR) are compared in Table 
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Table 8.5 

ID parallel over /under-relaxation method vs. ID sequential SOR for grid resolution 41- 



method 


iteration 


Li-error 


ujl- optimum 


lur- optimum 


LRSOR 


51 


9.99298e-4 


1.86887 




RLSOR 


31 


9.38294e-4 




1.86637 


SSOUR 


58 


9.99474e-4 


0.24825 


1.87087 


PSOUR(2) 


33 


8.68232e-4 


1.84785 


1.91892 


PSOUR(4) 


57 


9.99032e-4 


0.17317 


1.87588 


PSOUR(6) 


57 


9.98090e-4 


0.11712 


1.87387 


PSOUR(8) 


58 


9.99746e-4 


0.09910 


1.87087 


PSOUR(IO) 


57 


9.97719c-4 


0.18118 


1.87287 


PSOUR(14) 


52 


9.99428e-4 


0.36837 


1.89590 


PSOUR(18) 


53 


9.98391e-4 


0.33233 


1.87988 



8.5 for grid resolution 41 (the optimal values of relaxation factors are searched in 
[0.0, 2.0] with accuracy l.e — 3). 

8.2. Two-dimensional numerical experiment. In our two-dimensional nu- 
merical experiments (8.1) is solved by row-wise ordering Gauss-Seidel (RGS) and SOR 
(RSOR), symmetric Gauss-Seidel (SGS) and SOR (SSOR), frontal sweeping Gauss- 
Seidel (FGS) and SOR (FSOR) and the presented parallel Gauss-Seidel (PGS(p x , p y ), 
where p x and p y are the number of sub-domains in x and y directions respectively) 
and SOR (PSOR^, p y )) on grid resolutions 51 x 51, 101 x 101 and 151 x 151. 

Since the topology of domain decomposition may changes the convergence rate of 
parallel solver, strip- wise and square-wise topologies are examined in this experiment 
(cf. figure 8.1). The strip- wise and square- wise domain decompositions have a p x 1 
and ^fp x yfp topologies respectively, where p is the number of sub-domains. 




2x2 3x3 4x4 5x5 



Fig. 8.1. Strip-wise (up) and square-wise (down) sub-domains topologies used in this study. 

Results of this experiment are summarized in Tables 8.6, 8.7 and 8.8. According 
to tables, the number of iterations to meet the convergence for the presented parallel 
method is close to that of sequential ones, and there is not sensible dependency to 
the topology of domain decomposition. Note that in this experiment the relaxation 
parameters for all directions are taken to be equal. 
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Table 8.6 

2D parallel Gauss-Seidel vs. 2D sequential Gauss-Seidel. 



grid 
method 


51 x 51 




101 x 101 




151 x 151 




iteration 


Li -error 


iteration 


L\ -error 


iteration 


Li-error 


RGS 


1018 


9.98560c-4 


4065 


9.99123e-4 


9139 


9.99798e-4 


SGS 


1006 


9.98720e-4 


4038 


9.99374e-4 


9097 


9.99960e-4 


FGS 


1006 


9.96308e-4 


4037 


9.99753e-4 


9097 


9.99686e-4 


PGS(4xl) 


1020 


9.97194e-4 


4066 


9.99808e-4 


9140 


9.99839e-4 


PGS(2x2) 


1020 


9.97652c-4 


4065 


9.99717c-4 


9138 


9.99888e-4 


PGS(9xl) 


1038 


9.97011e-4 


4103 


9.99232e-4 


9195 


9.99884e-4 


PGS(3x3) 


1029 


9.99224e-4 


4082 


9.99687c-4 


9163 


9.99873c-4 


PGS(25xl) 


1088 


9.96490e-4 


4219 


9.99572c-4 


9371 


9.99965e-4 


PGS(5x5) 


1049 


9.96392c-4 


4116 


9.99972e-4 


9213 


9.99814e-4 



Table 8.7 

2D parallel SOR vs. 2D sequential SOR. for u> = 1.25. 



grid 
method 


51 x 51 




101 x 101 




151 x 151 




iteration 


Ll-error 


iteration 


Ll-error 


iteration 


Li-crror 


RSOR 


616 


9.97982e-4 


2450 


9.99198c-4 


5501 


9.99339c-4 


SSOR 


606 


9.95519e-4 


2425 


9.98966e-4 


5461 


9.99332e-4 


FSOR 


605 


9.95328e-4 


2424 


9.98924e-4 


5460 


9.99310e-4 


PSOR(4xl) 


626 


9.95846e-4 


2467 


9.98854e-4 


5524 


9.99442e-4 


PSOR(2x2) 


626 


9.98955e-4 


2465 


9.99991e-4 


5521 


9.99318c-4 


PSOR(9xl) 


652 


9.96804e-4 


2520 


9.99830e-4 


5605 


9.99658e-4 


PSOR(3x3) 


637 


9.99291e-4 


2487 


9.99869e-4 


5556 


9.99914c-4 


PSOR(16xl) 


685 


9.94546e-4 


2593 


9.99467e-4 


5718 


9.99331c-4 


PSOR(4x4) 


648 


9.98578e-4 


2512 


9.99153e-4 


5590 


9.99347c-4 


PSOR(25xl) 


715 


9.94756e-4 


2688 


9.99302e-4 


5863 


9.99984e-4 


PSOR(5x5) 


662 


9.94443e-4 


2535 


9.99867c-4 


5624 


9.99977c-4 



Table 8.8 

2D parallel SOR vs. 2D sequential SOR. for u = 1.5. 



grid 
method 


51 x 51 




101 x 101 




151 x 151 




iteration 


Li-error 


iteration 


Li-crror 


iteration 


Li-crror 


RSOR 


348 


9.90645e-4 


1373 


9.98771e-4 


3074 


9.99890e-4 


SSOR 


341 


9.88632e-4 


1351 


9.99049e-4 


3038 


9.98968e-4 


FSOR 


339 


9.89902e-4 


1349 


9.99485e-4 


3036 


9.99170e-4 


PSOR(4xl) 


369 


9.97722e-4 


1415 


9.98874e-4 


3133 


9.99910c-4 


PSOR(2x2) 


369 


9.89711e-4 


1410 


9.98715c-4 


3127 


9.99310e-4 


PSOR(9xl) 


407 


9.97695e-4 


1498 


9.97574e-4 


3259 


9.99389e-4 


PSOR(3x3) 


382 


9.99520e-4 


1443 


9.97418c-4 


3179 


9.98991c-4 


PSOR(16xl) 


453 


9.94660e-4 


1606 


9.96612c-4 


3432 


9.99365e-4 


PSOR(4x4) 


396 


9.94275e-4 


1474 


9.99090e-4 


3227 


9.98820e-4 


PSOR(25xl) 


477 


9.90157e-4 


1736 


9.99730e-4 


3645 


9.99353e-4 


PSOR(5x5) 


407 


9.99724e-4 


1504 


9.98626e-4 


3274 


9.99382e-4 
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8.3. Thee-dimensional numerical experiment. In our three-dimensional nu- 
merical experiments (8.1) is solved by row-wise sweeping Gauss-Seidel (RGS) and SOR 
(RSOR), symmetric sweeping Gauss-Seidel (SGS) and SOR (SSOR), frontal sweep- 
ing Gauss-Seidel (FGS) and SOR (FSOR) and the presented parallel Gauss-Seidel 
(PGS(p x , p y , p z ), where p x , p y and p z are number of sub-domains in x, y and z di- 
rections respectively) and SOR (PSOR^, p y , p z )) on grid resolutions 25 x 25 x 25, 
51 x 51 x 51 and 101 x 101 x 101. Figure 8.2 shows the topologies of domain decom- 
positions used in this numerical experiment. The relaxation parameters are taken to 
be equal along all sweeping directions in this section. 

Table 8.9, 8.10 and 8.11 show results of this numerical experiment. Tables show 
that the presented parallel 3D solvers are convergent and number of iterations for 
achieving convergence are close to that of sequential solvers, in particular in some 
cases the parallel solver performs slightly better than the sequential solver. Like 
2D results, the convergence rate of 3D solver has a little dependency to topology of 
domain decomposition in this experiment. 

In contrast to previously suggested parallel SOR methods in literature our results 
are very promising, as we do not observe sensible convergence decay due to paral- 
lelization which is a pathological weakness of alternative parallel SOR methods. This 
observation suggest us to use the presented parallel SOR solver as a cache efficient 
SOR method which is disseised in the nest section. 




3x2x2 5x3x1 4x2x2 3x3x3 



Fig. 8.2. Topologies of domain decompositions in 3D numerical experiment of the present study. 
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Table 8.9 

3D parallel Gauss-Seidel vs. 3D sequential Gauss-Seidel. 



grid 
method 


25x25x25 




51x51x51 




101x101x101 




iteration 


Li -error 


iteration 


L\ -error 


iteration 


L\ -error 


RGS 


110 


9.92077e-3 


480 


9.96665e-3 


1921 


9.99432e-3 


SGS 


104 


9.95897e-3 


466 


9.99655e-3 


1893 


9.99775c-3 


FGS 


104 


9.86877c-3 


466 


9.97495e-3 


1893 


9.99243e-3 


PGS(3x3x3) 


105 


9.96386e-3 


468 


9.99384e-3 


1898 


9.99252e-3 


PGS(2x2xl) 


105 


9.99993e-3 


469 


9.98302e-3 


1898 


9.99899e-3 


PGS(7xlxl) 


107 


9.87250e-3 


472 


9.97408e-3 


1905 


9.99024e-3 


PGS(2x2x2) 


106 


9.93976e-3 


470 


9.99815c-3 


1901 


9.99735e-3 


PGS(llxlxl) 


108 


9.92317c-3 


475 


9.98559e-3 


1911 


9.99717c-3 


PGS(3x2x2) 


107 


9.92017c-3 


472 


9.96988e-3 


1904 


9.99496e-3 


PGS(5x3xl) 


108 


9.94702e-3 


474 


9.97328e-3 


1907 


9.99631e-3 


PGS(4x2x2) 


108 


9.83564e-3 


473 


9.97272e-3 


1906 


9.99269e-3 


PGS(3x3x3) 


109 


9.91718c-3 


475 


9.97940e-3 


1909 


9.99410e-3 



Table 8.10 

3D parallel SOR vs. 3D sequential SOR. for w = 1.25. 



grid 
method 


25x25x25 




51x51x51 




101x101x101 




iteration 


Li-error 


iteration 


Li-error 


iteration 


L\ -error 


RSOR 


69 


9.77818c-3 


293 


9.99745e-3 


1164 


9.99186e-3 


SSOR 


63 


9.92592e-3 


281 


9.93894e-3 


1137 


9.98696e-3 


FSOR 


62 


9.95549e-3 


280 


9.94525e-3 


1136 


9.98851e-3 


PGS(3x3x3) 


65 


9.75364e-3 


284 


9.97072e-3 


1144 


9.98524e-3 


PGS(2x2xl) 


65 


9.87121c-3 


284 


9.99866e-3 


1144 


9.99540e-3 


PGS(7xlxl) 


67 


9.85868e-3 


289 


9.95523e-3 


1154 


9.99055e-3 


PGS(2x2x2) 


66 


9.89078e-3 


287 


9.93992e-3 


1149 


9.98397e-3 


PGS(llxlxl) 


69 


9.73977c-3 


293 


9.99839e-3 


1164 


9.99148e-3 


PGS(3x2x2) 


67 


9.96697e-3 


288 


9.98010c-3 


1152 


9.99076c-3 


PGS(5x3xl) 


68 


9.81936e-3 


291 


9.95989e-3 


1157 


9.99021e-3 


PGS(4x2x2) 


68 


9.84237e-3 


290 


9.95890e-3 


1155 


9.99221e-3 


PGS(3x3x3) 


69 


9.90408c-3 


292 


9.99708e-3 


1159 


9.99217c-3 



Table 8.11 

3D parallel SOR vs. 3D sequential SOR. for u = 1.5. 



grid 
method 


25x25x25 




51x51x51 




101x101x101 




iteration 


Li -error 


iteration 


L\ -error 


iteration 


L\ -error 


RSOR 


41 


9.82565c-3 


169 


9.97880c-3 


659 


9.99566e-3 


SSOR 


36 


9.90569e-3 


157 


9.95554e-3 


633 


9.97890e-3 


FSOR 


35 


9.57206e-3 


155 


9.99694e-3 


631 


9.98964e-3 


PGS(3x3x3) 


39 


9.58539e-3 


162 


9.93716c-3 


643 


9.99297e-3 


PGS(2x2xl) 


38 


9.81012c-3 


162 


9.99555e-3 


644 


9.99660e-3 


PGS(7xlxl) 


41 


9.90818c-3 


170 


9.90995e-3 


659 


9.99539e-3 


PGS(2x2x2) 


40 


9.87981c-3 


166 


9.95387e-3 


651 


9.99802e-3 


PGS(llxlxl) 


44 


9.61852e-3 


178 


9.90053e-3 


675 


9.99266e-3 


PGS(3x2x2) 


41 


9.72949e-3 


169 


9.93594e-3 


655 


9.98725e-3 


PGS(5x3xl) 


42 


9.84760e-3 


172 


9.90824e-3 


662 


9.98893e-3 


PGS(4x2x2) 


42 


9.66169e-3 


170 


9.93732e-3 


660 


9.97495e-3 


PGS(3x3x3) 


43 


9.73555c-3 


173 


9.92163e-3 


664 


9.99070e-3 
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8.4. Cache-efficient SOR relaxation. The growing speed gap between pro- 
cessor and memory has led to the development of hierarchical memories and utility 
of using caches in modern processors [22]. Nowadays the speed of a code depends in- 
creasingly on how well the cache structure is exploited in the course of computation. 
The number of cache misses is recently an important factor parallel to the number 
of floating point operations (FLOPS), during comparison of numerical algorithms. 
Unfortunately, it is not easy to a-priori estimate number of cache misses, but it is 
easy to perform a posteriori analysis [9]. 

When a program references a memory location, the data in the referenced location 
and nearby locations are brought into a cache level. Any additional references to data 
before these are evicted from the cache will be one or two orders of magnitude faster 
than references to main memory. So increasing the data locality leads to decreasing 
cache miss and so the computational performance. A program is said to have a 
good data locality if, most of the time, the computer finds the data referenced by the 
program in its cache. The locality of a program can be improved by changing the order 
of computation and/or the assignment of data to memory locations so that references 
to the same or nearby locations occur near to each other during the program execution 
[24]. Interested readers are refereed to [9, 12, 22, 24] for further details about this 
topic. 

Based on the above discussion, each parallel algorithm based on the domain de- 
composition concept is potentially a cache efficient algorithm. The presented method 
in this study decomposes the global data into smaller sub-domains (with more local- 
ity), and after getting essential information from neighbor sub-domains at start of 
each iteration, performs computations in each sub-domain independently. In this sec- 
tion we shall study the performance of presented parallel SOR from the viewpoint of 
cache efficiency. For this purpose the 3D model problem is solved with the presented 
parallel Gauss-Seidel method (oj = 1.0) on grid resolutions 25 x 25 x 25, 51 x 51 x 51, 
101 x 101 x 101 and 151 x 151 x 151 (a single processor is used in this experiment). 

Let's the define the efficiency-factor as the ratio of the total CPU time for cache 
efficient solver to that of classic sequential solver. Figure 8.3 shows variation of 
efficiency-factor vs. number of sub-domains in this experiment. The plot shows that 
when the size of global data is sufficiently large (the cache-miss is susceptible), the 
efficiency-factor is increased with increasing the number of sub-domains (data local- 
ity). Note that with increasing the number of sub-domain the number of FLOPS is 
slightly increased, but due to mentioned effect solver performs superior. 

9. Closing remarks. In the present study, a new method is developed to par- 
allelize sequential sweeping procedure on structured grids, and is used to parallelize 
the SOR (and ILU) preconditioning method on structured grids. This method can be 
considered as a special overlapping domain decomposition which uses a fully parallel 
multi-frontal sweeping strategy with local coupling at sub-domains's interfaces. The 
implementation of method in one and two spatial dimensions is discussed in details 
and its extension to higher dimension is briefly outlined. The application of method 
then extended to general structured n-diagonal matrices. Using notion of alternatively 
block upper-lower triangular matrices, the convergence theory is established in gen- 
eral. Our numerical show that the convergence rate of the presented method is close 
to that of sequential solver. It is indeed a very promising result does not reported 
for previous parallel versions of SOR to our knowledge (in fact our method does not 
show sensible convergence rate decay due to parallilization) . This result suggest us 
to use this solver as cache efficient SOR method. Numerical results on this issue also 
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Fig. 8.3. Parallel SOU as a cache-efficient solver: The variation of efficiency factor vs. number 
of sub-domains for 3D model problem on grid resolutions 25 3 , 51 3 , 101 3 and 151 3 . 



support effectiveness of this idea. 

ft is worth mentioning that our strategy can be directly sued to parallelize other 
sequential procedures on structured grid. For example in the supplementary material 
to this work (see: [26]), we use the same strategy to parallelize the fast sweeping 
method to solve the eikonal equation. The efficiency of that method in some cases 
(complex geometries) are significantly better than the original fast sweeping as well as 
classic fast marching method. Moreover in [25], we introduce a fully explicit and un- 
conditionally stable parallel method to solve nonlinear transient heat equation. This 
method is based on parallelizing multi-dimensional version of the Saul'yev scheme 
using the presented marching algorithm in this study. Finally, we believe that the 
presented strategy can be extended to semi-structured grids (like quadtree/octree 
grids). At the moment, we think that under certain assumptions such an extension 
is straightforward. Lets to outline this idea on quadtrees. We denote by Cartesian 
graded quadtree a grid which is resulted from an isotropic hierarchical graded refine- 
ment of a square-shaped root cell. Now consider quadtree grid Q. Assume there is an 
orientation preserving deformation T which maps Q into a graded Cartesian quadtree 
denoted by Q. Moreover, assume that there is a square-shaped decomposition 2 2? on 
Q such that every sub-domain is still a graded Cartesian quadtree. Then it is possible 
to apply the presented parallel strategy on T>. In this case the sweeping directions 
within each sub-domain are determined based on Z-ordcr space filing cures (cf. [6]) 
started from sub-domain's corners. 

Acknowledgements. We would like to thanks triple referees for their construc- 
tive comments which significantly improve the contents of this paper. 



2 a decomposition with ^/p X ^/p topology, where p is number of sub-domains. 
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